Analysis of the Content and Style Representations

Referring to the paper, Image Style Transfer Using Convolutional Neural Networks, by Gatys, the authors discovered that the Image Style Transfer can be done by utilizing a pre-trained convolutional neural network (vgg19) to extract the content and style representations of a content and style image and use them to create a new image (See more details in the paper or this Udacity's lab here)

I have performed an experiment to apply the Style Transfer learning in another notebook StyleTransfer_Experiment_Results.ipynb. In this notebook, I will be focusing on gaining a better understanding of the Content Representations and Style Representations of the images used in the experiment

In short, based on the findings described in the paper, we can extract the Content Representation and Style Representations of an image by using the image to perform a forward through a pre-trained convolutional neural network vgg19 and get those representation by:

  • Content Representation: Perform a forward pass with a content image through the network. Then, retrieve an output of the convolutional layer conv4_2 and use it as a content representation
  • Style Representations: Perform a forward pass with a style image through the network. Then,
    • retrieve an output of the convolutional layers conv1_1, conv2_1, conv3_1, conv4_1, conv5_1
    • For each layer, compute a Gram Matrix and we will use this matrix as a Style Representation (It represents the idea of share in prominent styles in each feature map)

To recap on what I did in the experiment in another notebook StyleTransfer_Experiment_Results.ipynb, I performed the Style Transfer process using a content image which is an image of a God Statue in the Grand Palace, Bangkok, Thailand, and a style image which is a Thai-artistic image to create a new image as shown below:

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'


import torch
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image

from src.StyleTransfer import StyleTransfer
from src.internal import im_convert
In [2]:
content_image = 'images/GodStatute_GrandPalace_Bangkok.jpg'
style_image = 'images/mah_tip.jpg'

st = StyleTransfer(content_image=content_image, style_image=style_image)
st.showInputImages()
In [3]:
print('Target Image:')
target_im = Image.open('output/god_statue_small_lr.png')
target_im
Target Image:
Out[3]:

Analysis of the Content Representation

To visualize the Content Representation of the content image, we will inspect the output of the conv4_2 layer when forward passing the content image through the vgg19 network as shown in the picture below:

Note that when constructing a StyleTransfer object, this operation has already been done and the tensor data of the interested convolutional layers can be retrieved by accessing the data property of the object

In [4]:
# Inspect available data stored in the object
st.data.keys()
Out[4]:
dict_keys(['tensor_content', 'tensor_style', 'tensor_target', 'features_content', 'features_style', 'style_grams'])
In [5]:
# Then, retrieve the output of the 'conv4_2' layer
feat_content_all = st.data['features_content']
feat_content = feat_content_all['conv4_2']
feat_content.size()
Out[5]:
torch.Size([1, 512, 85, 64])
In [6]:
# Remove the first dimension since we don't need it
feat_content = feat_content.squeeze()
feat_content.size()
Out[6]:
torch.Size([512, 85, 64])

There are a total of 512 filters in this conv4_2 layer and, as you can see in the images below, each filter detect a different object and shape arrangement of the content image and we will use the output of those filters as our Content Representation in the Style Transfer process

In [7]:
# Inspect an output of each filter via 
st.showContentRepresentation()
***** Content Representation *****

Analysis of the Style Representations

To visualize the Style Representations of the style image, we will

  • Compute a gram matrix from the output of the conv1_1, conv2_1, conv3_1, conv4_1, conv5_1

when forward passing the style image through the vgg19 network as shown in the picture below:

Below is an illustration of how to compute a grame matrix for a particular convolutions layer:

Note that when constructing a StyleTransfer object, this operation has already been done and Gram Matrices can be retrieved by accessing the data property of the object

In [8]:
style_grams = st.data['style_grams']
In [9]:
style_grams
Out[9]:
{'conv1_1': tensor([[ 20186.1445,  17662.2480,  18747.7891,  ...,   4591.7324,
            6172.3379,  17084.3984],
         [ 17662.2480, 229796.2969,  16045.7061,  ..., 113822.8906,
           31240.9551, 106558.4453],
         [ 18747.7891,  16045.7061,  44846.0664,  ...,   1174.8192,
           12487.8350,  26915.5723],
         ...,
         [  4591.7324, 113822.8906,   1174.8192,  ..., 648161.8125,
           92145.4375, 133501.1562],
         [  6172.3379,  31240.9551,  12487.8350,  ...,  92145.4375,
           63765.1562,  49467.1406],
         [ 17084.3984, 106558.4453,  26915.5723,  ..., 133501.1562,
           49467.1406, 215700.7344]], device='cuda:0'),
 'conv2_1': tensor([[316412.7812,  19379.4238, 126752.6641,  ..., 131583.1250,
           99010.3359, 116919.6719],
         [ 19379.4238,  50601.4219,  15324.7578,  ...,  71526.0312,
           40577.9727,  17273.3770],
         [126752.6641,  15324.7578, 173083.6406,  ...,  77164.2188,
           39419.2539,  55382.5859],
         ...,
         [131583.1250,  71526.0312,  77164.2188,  ..., 501056.6562,
          143926.0938,  76526.5078],
         [ 99010.3359,  40577.9727,  39419.2539,  ..., 143926.0938,
          321856.1875, 145728.8594],
         [116919.6719,  17273.3770,  55382.5859,  ...,  76526.5078,
          145728.8594, 256298.4688]], device='cuda:0'),
 'conv3_1': tensor([[125165.3516,  59849.1719,  19542.8184,  ...,  36360.7266,
           30432.0410,  42761.4609],
         [ 59849.1719, 358204.7812,  85755.2031,  ..., 141672.6719,
           82754.8672, 103797.6172],
         [ 19542.8184,  85755.2031, 176579.5469,  ...,  57931.9922,
           57897.1484,  44485.9766],
         ...,
         [ 36360.7266, 141672.6719,  57931.9922,  ..., 530079.3750,
           74985.9766, 138090.6562],
         [ 30432.0410,  82754.8672,  57897.1484,  ...,  74985.9766,
          298433.9375,  69579.9844],
         [ 42761.4609, 103797.6172,  44485.9766,  ..., 138090.6562,
           69579.9844, 385306.1875]], device='cuda:0'),
 'conv4_1': tensor([[ 48371.1641,   7696.0894,   1510.5769,  ...,   4526.6494,
           12110.4980,  12914.2041],
         [  7696.0894, 278378.3125,  23440.6230,  ...,  58121.7031,
           33870.8867,  25198.7695],
         [  1510.5769,  23440.6230,  53501.8789,  ...,   2925.0854,
           22483.7285,   7187.8257],
         ...,
         [  4526.6494,  58121.7031,   2925.0854,  ..., 159275.5938,
           52200.5391,   4879.4053],
         [ 12110.4980,  33870.8867,  22483.7285,  ...,  52200.5391,
          358177.4062,  21872.5234],
         [ 12914.2041,  25198.7695,   7187.8257,  ...,   4879.4053,
           21872.5234,  60930.4688]], device='cuda:0'),
 'conv4_2': tensor([[2.4458e+04, 3.6681e+02, 4.8301e+03,  ..., 2.2787e+03, 2.4650e+02,
          2.3486e+03],
         [3.6681e+02, 3.1574e+03, 2.1593e+02,  ..., 1.0366e+01, 2.6437e+01,
          1.8326e+02],
         [4.8301e+03, 2.1593e+02, 7.3426e+04,  ..., 5.3179e+03, 1.0846e+03,
          6.1677e+03],
         ...,
         [2.2787e+03, 1.0366e+01, 5.3179e+03,  ..., 3.6832e+04, 3.4723e+02,
          8.5764e+03],
         [2.4650e+02, 2.6437e+01, 1.0846e+03,  ..., 3.4723e+02, 3.7501e+03,
          2.5981e+02],
         [2.3486e+03, 1.8326e+02, 6.1677e+03,  ..., 8.5764e+03, 2.5981e+02,
          4.1481e+04]], device='cuda:0'),
 'conv5_1': tensor([[1.0305e+03, 1.2176e+03, 5.8051e+00,  ..., 2.8116e+02, 6.1422e+01,
          1.4910e+02],
         [1.2176e+03, 3.9585e+04, 6.9854e+02,  ..., 2.9981e+03, 1.0798e+03,
          1.0487e+03],
         [5.8051e+00, 6.9854e+02, 8.4804e+02,  ..., 1.8966e+02, 7.1923e+02,
          1.1748e+02],
         ...,
         [2.8116e+02, 2.9981e+03, 1.8966e+02,  ..., 4.4794e+03, 1.9319e+02,
          2.6045e+02],
         [6.1422e+01, 1.0798e+03, 7.1923e+02,  ..., 1.9319e+02, 3.4795e+03,
          3.8253e+01],
         [1.4910e+02, 1.0487e+03, 1.1748e+02,  ..., 2.6045e+02, 3.8253e+01,
          1.0325e+03]], device='cuda:0')}

Below are heatmaps of the computed gram matrices in each convolutional layer.

Note that most values in the diagonal lines have a higher value (bright color) which makes sense because it is a value of a feature map multiplies with its transpose!

However, notice that there are many cells that are no on the diagonal line but have a bright color. Those are an indication of two feature maps that are very similar. (Note that the deeper layer has more filters and you might need to open the plot outside the notebook to see those cells)

In [10]:
# Create a heatmap of a computed gram matrix in each convolutional layer
st.showStyleRepresentations()

Let's inspect the filter outputs that is used to compute a gram matrix

Output of filters in the 'conv1_1' layer

Generally, the earlier layers will use to detect the LARGER style artifacts as you can see in the images of the output of each filter in the conv1_1 layer below

In [11]:
st.showStyleFiltersAtLayer('conv1_1')

Let's inspect the computed gram matrix for this layer. As mentioned earlier, the top values are those on the diagonal lines but there are some cells that have a very large number (e.g. (46, 61), (61, 60), (24, 54)) which have even a larger value than many values on the diagonal lines.

In [12]:
st.showStyleRepresentations(layer_names=['conv1_1'])
In [13]:
# Get the top 20 cells
out = st.getGramMatrixIndicesSortedDescendingly('conv1_1')
out[:20]
Out[13]:
[(61, 61),
 (24, 24),
 (46, 61),
 (61, 46),
 (61, 60),
 (60, 61),
 (24, 54),
 (54, 24),
 (42, 61),
 (61, 42),
 (11, 11),
 (34, 61),
 (61, 34),
 (54, 54),
 (34, 34),
 (46, 46),
 (30, 30),
 (14, 24),
 (24, 14),
 (60, 60)]

Below are the output images of filters with the top values (excluding the diagonal line) in the gram matrix.

You can easily see that those output images have very similar colors and textures!

In [14]:
st.showTopMatchedStyleFilters('conv1_1')

Let's repeat the same process for the conv2_1, conv3_1, conv4_1, and conv5_1 layers

As you can see in the images below, as we go deeper in the network, a convolutional layer will detect and emphasize more on SMALLER features!

In [15]:
## Inspect the "conv2_1" layer
st.showStyleFiltersAtLayer('conv2_1')
In [16]:
st.showTopMatchedStyleFilters('conv2_1')
In [17]:
## Inspect the "conv3_1" layer
st.showStyleFiltersAtLayer('conv3_1')
In [18]:
st.showTopMatchedStyleFilters('conv3_1')
In [19]:
## Inspect the "conv4_1" layer
st.showStyleFiltersAtLayer('conv4_1')
In [20]:
st.showTopMatchedStyleFilters('conv4_1')
In [21]:
## Inspect the "conv5_1" layer
st.showStyleFiltersAtLayer('conv5_1')
In [22]:
st.showTopMatchedStyleFilters('conv5_1')

Conclusion

  • To extract the Content Representation, we can use a content image to perform a forward pass through the vgg19 network and retrieve the output of the conv4_2 layer
  • To extract the Style Representations, we can use a style image to perform a forward pass through the vgg19 network and retrieve the output of the conv1_1, conv2_1, conv3_1, conv4_1, conv5_1 layers to compute Gram Matrices (We will use the gram matrices as our style representation)
  • The Gram Matrices is a mathematical way of representing the idea of share in prominent styles (similarities) between feature maps
  • As we go deeper in the network, a convolution layer will emphasize more on detecting smaller features comparing to the earlier layer and we can adjust a weight value for those layers depending on how we much we would like the Larger or Smaller style artifacts to apply on the target image